In [ ]:
# Here we introduce Data science by starting with a common regression model(logistic regression). The example uses the Iris Dataset
# We also introduce Python as we develop the model. (The Iris dataset section is adatped from an example from Analyics Vidhya)
# Python uses some libraries which we load first.
# numpy is used for Array operations
# mathplotlib is used for visualization
import numpy as np
import matplotlib as mp
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
In [ ]:
dataset = datasets.load_iris()
In [ ]:
# Display the data
dataset
In [ ]:
# first we need to understand the data
from IPython.display import Image
from IPython.core.display import HTML
Image("https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg")
In [ ]:
Image("http://www.opengardensblog.futuretext.com/wp-content/uploads/2016/01/iris-dataset-sample.jpg")
In [ ]:
# In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y
# and one or more explanatory variables (or independent variables) denoted X. There are differnt types of regressions that model the
# relationship between the independent and the dependent variables
# In linear regression, the relationships are modeled using linear predictor functions whose unknown model
# parameters are estimated from the data. Such models are called linear models.
# In mathematics, a linear combination is an expression constructed from a set of terms by multiplying
# each term by a constant and adding the results (e.g. a linear combination of x and y would be any expression of the
# form ax + by, where a and b are constants)
# Linear regression
Image("https://www.biomedware.com/files/documentation/spacestat/Statistics/Multivariate_Modeling/Regression/regression_line.png")
In [ ]:
Image(url="http://31.media.tumblr.com/e00b481257fac723638b32271e611a2f/tumblr_inline_ntui2ohGy41sfzcxh_500.gif")
We use the Iris dataset
https://en.m.wikipedia.org/wiki/Iris_flower_data_set
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
logistic regression
While logistic regression gives each predictor (independent variable) a coefficient ‘b’ which measures its independent contribution to variations in the dependent variable, the dependent variable can only take on one of the two values: 0 or 1. What we want to predict from knowledge of relevant independent variables and coefficients is therefore not a numerical value of a dependent variable as in linear regression, but rather the probability (p) that it is 1 rather than 0 (belonging to one group rather than the other).
The outcome of the regression is not a prediction of a Y value, as in linear regression, but a probability of belonging to one of two conditions of Y, which can take on any value between 0 and 1 rather than just 0 and 1.
The crucial limitation of linear regression is that it cannot deal with dependent variable’s that are dichotomous and categorical. Many interesting variables are dichotomous: for example, consumers make a decision to buy or not buy, a product may pass or fail quality control, there are good or poor credit risks, an employee may be promoted or not. A range of regression techniques have been developed for analysing data with categorical dependent variables, including logistic regression and discriminant analysis.
In [ ]:
model = LogisticRegression()
model.fit(dataset.data, dataset.target)
In [ ]:
expected = dataset.target
predicted = model.predict(dataset.data)
In [ ]:
# classification metrics report builds a text report showing the main classification metrics
# In pattern recognition and information retrieval with binary classification,
# precision (also called positive predictive value) is the fraction of retrieved instances that are relevant,
# while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved.
# Both precision and recall are therefore based on an understanding and measure of relevance.
# Suppose a computer program for recognizing dogs in scenes from a video identifies 7 dogs in a scene containing 9 dogs
# and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program's precision is 4/7
# while its recall is 4/9.
# In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy.
# It considers both the precision p and the recall r of the test to compute the score:
# p is the number of correct positive results divided by the number of all positive results,
# and r is the number of correct positive results divided by the number of positive results that should have been returned.
# The F1 score can be interpreted as a weighted average of the precision and recall
print(metrics.classification_report(expected, predicted))
In [ ]:
# Confusion matrix
# https://en.wikipedia.org/wiki/Confusion_matrix
# In the field of machine learning, a confusion matrix is a table layout that allows visualization of the performance
# of an algorithm, typically a supervised learning one.
# Each column of the matrix represents the instances in a predicted class
# while each row represents the instances in an actual class (or vice-versa)
In [ ]:
# If a classification system has been trained to distinguish between cats, dogs and rabbits,
# a confusion matrix will summarize the results of testing the algorithm for further inspection.
# Assuming a sample of 27 animals — 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix
# could look like the table below:
Image("http://www.opengardensblog.futuretext.com/wp-content/uploads/2016/01/confusion-matrix.jpg")
# In this confusion matrix, of the 8 actual cats, the system predicted that three were dogs,
# and of the six dogs, it predicted that one was a rabbit and two were cats.
# We can see from the matrix that the system in question has trouble distinguishing between cats and dogs,
# but can make the distinction between rabbits and other types of animals pretty well.
# All correct guesses are located in the diagonal of the table, so it's easy to visually
# inspect the table for errors, as they will be represented by values outside the diagonal.
In [ ]:
print (metrics.confusion_matrix(expected, predicted))
In [ ]:
import pandas as pd
We typically need the following libraries:
NumPy Numerical Python - mainly used for n-dimensional array(which is absent in traditional Python). Also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++
SciPy Scientific Python (built on NumPy). Contains a variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
Matplotlib for plotting vast variety of graphs ex histograms, line plots and heat maps.
Pandas for structured data operations and data manipulation. It is extensively used for pre processing.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.
Additional libraries, you might need:
urllib for web based operations like opening URLs and performing operations os for Operating system and file operations networkx and igraph for graph based data manipulations regular expressions for finding patterns in text data BeautifulSoup for scrapping web
In [ ]:
integers_list = [1,3,5,7,9] # lists are seperated by square brackets
print(integers_list)
tuple_integers = 1,3,5,7,9 #tuples are seperated by commas and are immutable
print(tuple_integers)
tuple_integers[0] = 11
In [ ]:
#Python strings can be in single or double quotes
string_ds = "Data Science"
In [ ]:
string_iot = "Internet of Things"
In [ ]:
string_dsiot = string_ds + " for " + string_iot
In [ ]:
print (string_dsiot)
In [ ]:
len(string_dsiot)
In [ ]:
# sets are unordered collections with no duplicate elements
prog_languages = set(['Python', 'Java', 'Scala'])
prog_languages
In [ ]:
# Dictionaies are comma seperated key value pairs seperated by braces
dict_marks = {'John':95, 'Mark': 100, 'Anna': 99}
In [ ]:
dict_marks['John']
In [ ]: